Skip to content

Conversation

kevalmorabia97
Copy link
Collaborator

@kevalmorabia97 kevalmorabia97 commented Sep 18, 2025

What does this PR do?

  • Enforce llm_ptq test to pass per PR
  • Dont pre-install example dependencies in docker, install on run-time since we see many conflicting dependencies with growing number of examples
  • Remove torchvision default dependency and instead add it only in examples where needed (We've seen customer complaints where they have torch but not torchvision and then modelopt installs torchvision causing pre-installed source-compiled torch to be overwritten).

Testing

Summary by CodeRabbit

  • Documentation

    • Added detailed Linux system requirements, updated docker guidance, and switched clone instructions to SSH.
  • Chores

    • Docker now installs from the local source, expands CUDA architecture support, and simplifies example dependency handling.
    • Moved torchvision from base to dev-test and added it to several example requirements and a notebook install step.
  • Tests

    • New GPU-enabled E2E example tests with change-based gating, PR/non-PR runs, and required-check gating.
    • Adjusted CI trigger rules to ignore pyproject.toml and increased several workflow timeouts.
    • Removed automatic Transformers <4.50 installation from a test.

Copy link

coderabbitai bot commented Sep 18, 2025

Walkthrough

Adds a new GPU-enabled E2E GitHub Actions workflow with PR change detection and gating, adjusts CI triggers/timeouts, switches Docker builds to install the local repository and expands CUDA archs, moves torchvision into example requirements and dev-test extras, changes clone instructions to SSH, and removes an automated transformers pin in a test.

Changes

Cohort / File(s) Summary of Changes
E2E example tests workflow
.github/workflows/example_tests.yml
New "E2E Example tests" workflow with change-detection (check-file-changes), prereq waiter (wait-checks), PR and non-PR GPU jobs (example-tests-pr, example-tests-non-pr) using an NVIDIA container (matrix: llm_ptq), and a required-check gate (example-pr-required-check).
GitHub workflow tweaks
.github/workflows/gpu_tests.yml, .github/workflows/unit_tests.yml, .github/workflows/code_quality.yml, .github/workflows/pages.yml
Removed pyproject.toml from change/trigger path filters (gpu_tests, unit_tests); increased code_quality and pages job timeouts from 15→30 minutes.
GitLab CI alignment
.gitlab/tests.yml
Renamed matrix var TESTEXAMPLE; per-example paths/commands now use EXAMPLE; installs per-example requirements when present; updated tag selection (sm<89sm>=89).
Docker build (root image)
docker/Dockerfile
Replaced ARG PIP_EXTRA_INDEX_URL with ENV literal, expanded TORCH_CUDA_ARCH_LIST (added 12.0+PTX), copies local repo into image and installs -e "./TensorRT-Model-Optimizer[all,dev-test]", precompiles extensions in-image, removes .git; removed dynamic per-example requirement discovery.
ONNX PTQ Docker build
examples/onnx_ptq/docker/Dockerfile
Updated COPY and pip install paths from modelopt/* to TensorRT-Model-Optimizer/* and adjusted editable install to -e "./TensorRT-Model-Optimizer[hf,onnx]".
Docs and README install instructions
docs/source/getting_started/_installation_for_Linux.rst, README.md
Added a "System requirements" section; adjusted wording about preinstalled deps in Docker images; changed clone instruction from HTTPS to SSH (git clone [email protected]:NVIDIA/TensorRT-Model-Optimizer.git).
Dependency reshaping: torchvision
examples/cnn_qat/requirements.txt, examples/llm_eval/requirements.txt, examples/onnx_ptq/requirements.txt, setup.py, examples/pruning/cifar_resnet.ipynb
Added torchvision to multiple example requirements and a notebook install cell; removed torchvision from setup.py base required_deps and added it to optional_deps["dev-test"].
Tests cleanup
tests/examples/speculative_decoding/test_medusa.py
Removed a pytest fixture that ran pip install transformers<4.50 before tests; rest of test file unchanged.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Dev as Developer
  participant GH as GitHub
  participant WF as "E2E Example tests"
  participant CF as check-file-changes
  participant WC as wait-checks
  participant PRJ as example-tests-pr
  participant NPJ as example-tests-non-pr
  participant RC as example-pr-required-check

  Dev->>GH: Push / PR / Manual dispatch
  GH->>WF: Trigger workflow
  rect rgba(230,240,255,0.5)
    note right of WF: Change detection gate
    WF->>CF: compute merge-base & detect changed files
    CF-->>WF: any_changed = true/false
  end

  alt any_changed == true
    rect rgba(235,255,235,0.5)
      note over WF,WC: Wait for prerequisite checks (DCO or linux)
      WF->>WC: await match_pattern ^DCO$|^linux$
      WC-->>WF: prereqs passed/failed
    end

    alt Context is PR and prereqs passed
      WF->>PRJ: run GPU container tests (matrix: llm_ptq)
      PRJ-->>WF: results
    else Context is non-PR
      WF->>NPJ: run GPU container tests (matrix: llm_ptq)
      NPJ-->>WF: results
    end
  else any_changed == false
    note right of WF: Skip example test jobs
  end

  rect rgba(255,245,230,0.6)
    note right of RC: Required-check gate enforces PR status
    WF->>RC: evaluate outcomes
    RC-->>GH: report pass/fail status
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I thump my paw—new tests take flight,
Workflows hum in GPU light.
Dockers stretch their CUDA arch,
Extras hop into the march.
I nibble docs and SSH key—CI blooms for me. 🐇✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The title succinctly and accurately summarizes the two primary changes in the PR: adding an llm_ptq PR test and cleaning up dependency installation in the modelopt Docker image; it directly relates to the workflow and Dockerfile changes and is clear to a reviewer scanning history.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch kmorabia/dont-preinstall-docker-deps

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

codecov bot commented Sep 18, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.82%. Comparing base (b895dc5) to head (24db1b5).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #338   +/-   ##
=======================================
  Coverage   73.82%   73.82%           
=======================================
  Files         172      172           
  Lines       17438    17438           
=======================================
  Hits        12874    12874           
  Misses       4564     4564           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (8)
examples/pruning/cifar_resnet.ipynb (1)

25-33: Make the install cell idempotent and dev‑env friendly.

Avoid reinstalling over editable installs; gate on missing packages.

Apply:

-! pip install nvidia-modelopt torchvision
+import importlib.util, subprocess, sys
+missing = [m for m in ("modelopt", "torchvision") if importlib.util.find_spec(m) is None]
+if missing:
+    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "nvidia-modelopt", "torchvision"])
+else:
+    print("Dependencies already present; skipping install")
README.md (1)

69-73: Provide HTTPS clone as default and keep SSH as optional.

SSH requires keys; HTTPS works for most users.

- git clone [email protected]:NVIDIA/TensorRT-Model-Optimizer.git
+ # HTTPS (recommended for most users)
+ git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git
+ # Or SSH (requires GitHub SSH keys)
+ # git clone [email protected]:NVIDIA/TensorRT-Model-Optimizer.git
examples/onnx_ptq/docker/Dockerfile (2)

8-12: Drop SSH host scan unless you actually fetch via SSH.

It adds network dependency and build variability.

-    && mkdir -p -m 0600 ~/.ssh \
-    && ssh-keyscan github.com >> ~/.ssh/known_hosts
+    # SSH known_hosts not needed since we don't fetch via SSH

28-32: Tighten pip steps and validate env — TensorRT Python not present

Running import tensorrt raised ModuleNotFoundError — the Python TensorRT package isn't installed in the current environment, so duplication with the base image was not observed.

File: examples/onnx_ptq/docker/Dockerfile Lines: 28-32

-RUN pip install -r TensorRT-Model-Optimizer/examples/onnx_ptq/requirements.txt
+RUN pip install --no-cache-dir -r TensorRT-Model-Optimizer/examples/onnx_ptq/requirements.txt && pip check
@@
-RUN pip install -e "./TensorRT-Model-Optimizer[hf,onnx]"
+RUN pip install --no-cache-dir -e "./TensorRT-Model-Optimizer[hf,onnx]" && pip check
docs/source/getting_started/_installation_for_Linux.rst (1)

40-41: Minor grammar/clarity.

"examples's" → "examples'"; also link to a concrete file for discoverability.

-    dependencies pre-installed. You may need to install additional dependencies from the examples's `requirements.txt` file.
+    dependencies pre-installed. You may need to install additional dependencies from the examples' `requirements.txt` file (e.g., ``examples/llm_ptq/requirements.txt``).
.github/workflows/example_tests.yml (3)

63-65: Runner label is custom; configure actionlint or add a comment.

actionlint warns on unknown label. If you lint this repo, add the label to actionlint.yaml or gate actionlint for this workflow.


75-81: Install order OK; consider caching to speed up.

You already use a proxy cache. Optionally add “pip install -U pip wheel” first and “--upgrade” to ensure fresh extras.

-          pip install ".[all]"
+          python -m pip install -U pip wheel
+          pip install -U ".[all]"

82-91: YAML anchors trip actionlint; inline container/steps for non‑PR job.

GitHub Actions supports anchors, but actionlint’s diagnostics indicate it flags alias usage for mapping nodes here. To keep lint green, duplicate the mapping and steps.

   example-tests-non-pr:
     if: ${{ !startsWith(github.ref, 'refs/heads/pull-request/') }}
     # Runner list at https://github.com/nv-gha-runners/enterprise-runner-configuration/blob/main/docs/runner-groups.md
     runs-on: linux-amd64-gpu-h100-latest-1
     timeout-minutes: 90
     strategy:
       matrix:
         EXAMPLE: [llm_ptq]
-    container: *example_container
-    steps: *example_steps
+    container:
+      image: nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc2.post2
+      env:
+        LD_LIBRARY_PATH: "/usr/lib/x86_64-linux-gnu:/usr/local/tensorrt/targets/x86_64-linux-gnu/lib:${LD_LIBRARY_PATH}"
+        PATH: "/usr/local/tensorrt/targets/x86_64-linux-gnu/bin:${PATH}"
+        PIP_CONSTRAINT: ""
+    steps:
+      - uses: actions/checkout@v4
+      - uses: nv-gha-runners/setup-proxy-cache@main
+      - name: Run example tests
+        run: |
+          python -m pip install -U pip wheel
+          pip install -U ".[all]"
+          if [ -f examples/$EXAMPLE/requirements.txt ]; then pip install -r examples/$EXAMPLE/requirements.txt; fi
+          pytest -s tests/examples/$EXAMPLE
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 912f3dc and 3325c77.

📒 Files selected for processing (13)
  • .github/workflows/example_tests.yml (1 hunks)
  • .github/workflows/gpu_tests.yml (0 hunks)
  • .github/workflows/unit_tests.yml (0 hunks)
  • .gitlab/tests.yml (1 hunks)
  • README.md (1 hunks)
  • docker/Dockerfile (2 hunks)
  • docs/source/getting_started/_installation_for_Linux.rst (2 hunks)
  • examples/cnn_qat/requirements.txt (1 hunks)
  • examples/llm_eval/requirements.txt (1 hunks)
  • examples/onnx_ptq/docker/Dockerfile (1 hunks)
  • examples/onnx_ptq/requirements.txt (1 hunks)
  • examples/pruning/cifar_resnet.ipynb (1 hunks)
  • setup.py (1 hunks)
💤 Files with no reviewable changes (2)
  • .github/workflows/unit_tests.yml
  • .github/workflows/gpu_tests.yml
🧰 Additional context used
🪛 actionlint (1.7.7)
.github/workflows/example_tests.yml

63-63: label "linux-amd64-gpu-h100-latest-1" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


82-82: "steps" section is missing in job "example-tests-non-pr"

(syntax-check)


85-85: label "linux-amd64-gpu-h100-latest-1" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


90-90: "container" section is alias node but mapping node is expected

(syntax-check)


91-91: "steps" section must be sequence node but got alias node with "" tag

(syntax-check)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: linux
  • GitHub Check: wait-checks / wait
  • GitHub Check: wait-checks / wait
  • GitHub Check: code-quality
  • GitHub Check: build-docs
🔇 Additional comments (9)
setup.py (1)

81-82: LGTM: move torchvision out of base install.

This aligns with per‑example deps and reduces accidental torch churn from base installs.

Please confirm CI images satisfy torch>=2.6 to avoid resolver surprises when examples add torchvision.

examples/llm_eval/requirements.txt (1)

7-7: Keep torchvision — it's used in examples/llm_eval/modeling.py (import: from torchvision.datasets.utils import download_url at line 61).

Likely an incorrect or invalid review comment.

examples/onnx_ptq/requirements.txt (1)

5-5: Keep torchvision in examples/onnx_ptq/requirements.txt — it's required.

Imported in examples/onnx_ptq/evaluation.py (import torchvision.transforms as transforms; from torchvision.datasets import ImageNet) and examples/onnx_ptq/evaluate.py (from torchvision.datasets import ImageNet). Keep the dependency.

examples/cnn_qat/requirements.txt (1)

1-1: Pin torchvision to a torch‑compatible range (or verify in CI)

examples/cnn_qat/requirements.txt:1 lists unpinned torchvision. I ran the suggested verification but it failed with: ModuleNotFoundError: No module named 'torch' — verification couldn't be completed in this environment. Either pin torchvision to the Torch used in CI (to avoid CUDA/Torch mismatches) or re-run the check inside CI/an environment with torch installed and add a post-install pairing check.

docker/Dockerfile (1)

26-26: Precompilation step: good call.

Precompiling extensions avoids per-run latency. Keep.

Consider caching pip wheels in the image (or via a proxy cache) to further reduce CI runtimes.

docs/source/getting_started/_installation_for_Linux.rst (2)

15-26: Version matrix needs verification against current releases.

Please confirm Python (>=3.10,<3.13), CUDA (>=12.0), PyTorch (>=2.6), TensorRT-LLM (1.1.0rc2.post2), ONNX Runtime (1.22), TensorRT (>=10.0) are accurate minima and reflect what ModelOpt tests against today.

I can add a small “Tested with” table if you share the CI matrix.


165-168: Good: document precompilation.

This matches the Dockerfile step and sets user expectations.

.github/workflows/example_tests.yml (2)

69-74: Container env and PIP_CONSTRAINT reset look right.

Helps avoid pin conflicts when upgrading deps in job.


93-100: Required-check gate looks good.

Enforces failure when tests are expected but skipped/failed.

@kevalmorabia97 kevalmorabia97 force-pushed the kmorabia/dont-preinstall-docker-deps branch 2 times, most recently from f83a0fe to d44fb7e Compare September 18, 2025 15:05
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
docker/Dockerfile (2)

21-27: Don’t preinstall example deps in the base image; it contradicts PR goals.

This bakes llm_ptq deps globally and can clash with user torch/torchvision installs. Install at runtime in workflows/examples only.

 RUN pip install -e "./TensorRT-Model-Optimizer[all,dev-test]"
 RUN rm -rf TensorRT-Model-Optimizer/.git
 RUN python -c "import modelopt.torch.quantization.extensions as ext; ext.precompile()"
-RUN pip install -r TensorRT-Model-Optimizer/examples/llm_ptq/requirements.txt

7-7: Fix invalid TORCH_CUDA_ARCH_LIST entry (“12.0+PTX”) and stray “10.0”.

PyTorch expects SM versions; “12.0+PTX” is not a valid SM, and adding non‑supported SMs slows/ breaks builds. Keep only supported SMs and add +PTX to the highest.

-    TORCH_CUDA_ARCH_LIST="8.0 8.6 8.7 8.9 9.0 10.0 12.0+PTX"
+    TORCH_CUDA_ARCH_LIST="8.0 8.6 8.7 8.9 9.0+PTX"

If you truly target SM 10.0, confirm toolchain support and use 10.0+PTX as the highest instead of 12.0+PTX.

🧹 Nitpick comments (3)
.github/workflows/code_quality.yml (1)

18-18: Timeout bump looks fine; consider pip caching to keep it under 30 mins.

Add setup-python’s pip cache to speed up tox/pre-commit runs and reduce flakiness.

 - uses: actions/setup-python@v5
   with:
     python-version: "3.12"
+    cache: 'pip'
+    cache-dependency-path: |
+      requirements*.txt
+      pyproject.toml
+      tox.ini
docker/Dockerfile (1)

3-6: Optional: enable pip no‑cache to reduce layer size.

You set PIP_NO_CACHE_DIR=off. Consider enabling for image builds.

-    PIP_NO_CACHE_DIR=off \
+    PIP_NO_CACHE_DIR=on \
.github/workflows/example_tests.yml (1)

71-74: Consider exporting PATH to TRT bin if tests need CLI tools.

If tests call trtexec or other TRT binaries, uncomment PATH; otherwise this is fine.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3325c77 and d44fb7e.

📒 Files selected for processing (8)
  • .github/workflows/code_quality.yml (1 hunks)
  • .github/workflows/example_tests.yml (1 hunks)
  • .github/workflows/gpu_tests.yml (0 hunks)
  • .github/workflows/pages.yml (1 hunks)
  • .github/workflows/unit_tests.yml (0 hunks)
  • .gitlab/tests.yml (1 hunks)
  • docker/Dockerfile (2 hunks)
  • tests/examples/speculative_decoding/test_medusa.py (0 hunks)
💤 Files with no reviewable changes (3)
  • tests/examples/speculative_decoding/test_medusa.py
  • .github/workflows/gpu_tests.yml
  • .github/workflows/unit_tests.yml
✅ Files skipped from review due to trivial changes (1)
  • .github/workflows/pages.yml
🚧 Files skipped from review as they are similar to previous changes (1)
  • .gitlab/tests.yml
🧰 Additional context used
🪛 actionlint (1.7.7)
.github/workflows/example_tests.yml

63-63: label "linux-amd64-gpu-h100-latest-1" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


82-82: "steps" section is missing in job "example-tests-non-pr"

(syntax-check)


85-85: label "linux-amd64-gpu-h100-latest-1" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


90-90: "container" section is alias node but mapping node is expected

(syntax-check)


91-91: "steps" section must be sequence node but got alias node with "" tag

(syntax-check)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: linux
  • GitHub Check: wait-checks / wait
  • GitHub Check: wait-checks / wait
  • GitHub Check: code-quality
  • GitHub Check: build-docs
🔇 Additional comments (4)
.github/workflows/example_tests.yml (4)

79-81: Runtime example dep install matches PR intent. LGTM.

Installs project first, then per‑example requirements; aligns with avoiding baked deps.

If multiple requirements.txt files exist under an example, confirm install order doesn’t cause conflicts.


92-100: Required‑check guard is solid.

The gate fails the PR when relevant changes occur but GPU tests didn’t run/pass. Good safety net.


63-64: Confirm runner label exists or add it to actionlint config.

No actionlint config found in .github; workflows reference runs-on: linux-amd64-gpu-h100-latest-1 — ensure that self-hosted runner label exists in your org or add it to .github/actionlint.yaml to silence actionlint warnings.

Locations:

  • .github/workflows/example_tests.yml — runs-on: linux-amd64-gpu-h100-latest-1 (lines 63, 85)
  • .github/workflows/gpu_tests.yml — runs-on: linux-amd64-gpu-h100-latest-1 (line 79)

57-58: Gate waits verified — reusable workflow present and pattern matches job names.

  • .github/workflows/_wait_for_checks.yml exists and declares workflow_call inputs: match_pattern and delay.
  • example_tests.yml, gpu_tests.yml and unit_tests.yml call it with match_pattern '^DCO$|^linux$' (unit_tests defines the linux job and a check-dco that matches ^DCO$).
  • _wait_for_checks.yml does not export outputs — not required here because callers only use needs to depend on the job, not consume outputs.

@kevalmorabia97 kevalmorabia97 force-pushed the kmorabia/dont-preinstall-docker-deps branch from d44fb7e to c01a56f Compare September 18, 2025 16:21
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (4)
docker/Dockerfile (3)

7-7: Fix TORCH_CUDA_ARCH_LIST; "12.0+PTX" is not a valid SM and only the highest SM should have +PTX.
Use SMs (8.0/8.6/8.7/8.9/9.0). Add 10.0 only if your toolchain supports it.

-    TORCH_CUDA_ARCH_LIST="8.0 8.6 8.7 8.9 9.0 10.0 12.0+PTX"
+    TORCH_CUDA_ARCH_LIST="8.0 8.6 8.7 8.9 9.0+PTX"
+# If SM 10.0 is supported end-to-end, use:
+# TORCH_CUDA_ARCH_LIST="8.0 8.6 8.7 8.9 9.0 10.0+PTX"

27-27: Remove baked llm_ptq requirements; install at runtime in workflows/examples.
This contradicts “install example deps at runtime” and can clash with user torch builds.

-RUN pip install -r TensorRT-Model-Optimizer/examples/llm_ptq/requirements.txt
+# Example-specific dependencies are installed at runtime in CI or example entrypoints.

29-30: Avoid chmod 777; switch to non-root user with proper ownership.
777 is a security smell and breaks umask expectations.

-# Allow users to run without root
-RUN chmod -R 777 /workspace
+# Allow users to run without root (least privilege)
+RUN useradd -m -u 1000 appuser && chown -R appuser:appuser /workspace
+USER appuser
.github/workflows/example_tests.yml (1)

82-91: Inline anchors for container/steps in non‑PR job; actionlint flags alias/syntax errors.
Replace YAML aliases with concrete blocks.

   example-tests-non-pr:
     if: ${{ !startsWith(github.ref, 'refs/heads/pull-request/') }}
     # Runner list at https://github.com/nv-gha-runners/enterprise-runner-configuration/blob/main/docs/runner-groups.md
     runs-on: linux-amd64-gpu-h100-latest-1
     timeout-minutes: 90
     strategy:
       matrix:
         EXAMPLE: [llm_ptq]
-    container: *example_container
-    steps: *example_steps
+    container:
+      image: nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc2.post2
+      env:
+        LD_LIBRARY_PATH: "/usr/lib/x86_64-linux-gnu:/usr/local/tensorrt/targets/x86_64-linux-gnu/lib"
+        PIP_CONSTRAINT: ""
+    steps:
+      - uses: actions/checkout@v4
+      - uses: nv-gha-runners/setup-proxy-cache@main
+      - name: Run example tests
+        run: |
+          pip install ".[all,dev-test]"
+          find examples/$EXAMPLE -name "requirements.txt" | while read req_file; do pip install -r "$req_file" || exit 1; done
+          pytest -s tests/examples/$EXAMPLE
🧹 Nitpick comments (2)
.github/workflows/pages.yml (1)

26-26: Timeout bump acknowledged; add pip cache to reduce need for longer timeouts.
Caching pip deps will speed up tox builds and may let you keep the old timeout.

-      - uses: actions/setup-python@v5
-        with:
-          python-version: "3.12"
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+          cache: pip
+          cache-dependency-path: tox.ini
.github/workflows/example_tests.yml (1)

71-74: Env var expansion won’t happen here; LD_LIBRARY_PATH will literally contain “${LD_LIBRARY_PATH}”.
Set the full value explicitly or export in a run step.

-        LD_LIBRARY_PATH: "/usr/lib/x86_64-linux-gnu:/usr/local/tensorrt/targets/x86_64-linux-gnu/lib:${LD_LIBRARY_PATH}"
+        LD_LIBRARY_PATH: "/usr/lib/x86_64-linux-gnu:/usr/local/tensorrt/targets/x86_64-linux-gnu/lib"
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d44fb7e and c01a56f.

📒 Files selected for processing (8)
  • .github/workflows/code_quality.yml (1 hunks)
  • .github/workflows/example_tests.yml (1 hunks)
  • .github/workflows/gpu_tests.yml (0 hunks)
  • .github/workflows/pages.yml (1 hunks)
  • .github/workflows/unit_tests.yml (0 hunks)
  • .gitlab/tests.yml (1 hunks)
  • docker/Dockerfile (2 hunks)
  • tests/examples/speculative_decoding/test_medusa.py (0 hunks)
💤 Files with no reviewable changes (3)
  • .github/workflows/gpu_tests.yml
  • tests/examples/speculative_decoding/test_medusa.py
  • .github/workflows/unit_tests.yml
🚧 Files skipped from review as they are similar to previous changes (2)
  • .gitlab/tests.yml
  • .github/workflows/code_quality.yml
🧰 Additional context used
🪛 actionlint (1.7.7)
.github/workflows/example_tests.yml

63-63: label "linux-amd64-gpu-h100-latest-1" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


82-82: "steps" section is missing in job "example-tests-non-pr"

(syntax-check)


85-85: label "linux-amd64-gpu-h100-latest-1" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


90-90: "container" section is alias node but mapping node is expected

(syntax-check)


91-91: "steps" section must be sequence node but got alias node with "" tag

(syntax-check)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: linux
  • GitHub Check: wait-checks / wait
  • GitHub Check: wait-checks / wait
  • GitHub Check: code-quality
  • GitHub Check: build-docs
🔇 Additional comments (1)
.github/workflows/example_tests.yml (1)

79-81: CI installs example deps at runtime (good); align Dockerfile accordingly.
Keep this pattern; remove baked example deps and dev-test extras from the base image as suggested.

pip install -r "$req_file" || exit 1; \
done
RUN python -c "import modelopt.torch.quantization.extensions as ext; ext.precompile()"
RUN pip install -r TensorRT-Model-Optimizer/examples/llm_ptq/requirements.txt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can skip this too. If we start with the TRT LLM docker, most likely the dependencies are met

Copy link
Collaborator Author

@kevalmorabia97 kevalmorabia97 Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Maybe we can reconsider our docker strategy. Should we just not provide a separate dockerfile and directly use a trtllm docker?
Only change on top of trtllm is setting some env vars (for Onnx TensorRT) and install libgl1 which was causing some import errors previously as some other examples may need it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That may work as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll work on dockerfile removal as a separate PR after merging this

Copy link
Collaborator

@ChenhanYu ChenhanYu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM;

@kevalmorabia97 kevalmorabia97 force-pushed the kmorabia/dont-preinstall-docker-deps branch from c01a56f to 11835b1 Compare September 18, 2025 20:57
@kevalmorabia97 kevalmorabia97 force-pushed the kmorabia/dont-preinstall-docker-deps branch from 11835b1 to b423dbb Compare September 18, 2025 21:02
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
.github/workflows/example_tests.yml (1)

82-91: Inline container/steps for non‑PR job to satisfy actionlint (anchors here trip parser).

actionlint flags “container alias mapping expected” and “steps must be sequence” for the aliases. Duplicate the blocks inline for example-tests-non-pr.

   example-tests-non-pr:
     if: ${{ !startsWith(github.ref, 'refs/heads/pull-request/') }}
     # Runner list at https://github.com/nv-gha-runners/enterprise-runner-configuration/blob/main/docs/runner-groups.md
-    runs-on: linux-amd64-gpu-h100-latest-1
+    runs-on: linux-amd64-gpu-h100-latest-1
     timeout-minutes: 120
     strategy:
       matrix:
         EXAMPLE: [llm_ptq]
-    container: *example_container
-    steps: *example_steps
+    container:
+      image: nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc2.post2
+      env:
+        LD_LIBRARY_PATH: "/usr/lib/x86_64-linux-gnu:/usr/local/tensorrt/targets/x86_64-linux-gnu/lib:${LD_LIBRARY_PATH}"
+        # PATH: "/usr/local/tensorrt/targets/x86_64-linux-gnu/bin:${PATH}"
+        PIP_CONSTRAINT: ""
+    steps:
+      - uses: actions/checkout@v4
+      - uses: nv-gha-runners/setup-proxy-cache@main
+      - name: Run example tests
+        run: |
+          pip install ".[all,dev-test]"
+          find examples/${{ matrix.EXAMPLE }} -name "requirements.txt" | while read req_file; do pip install -r "$req_file" || exit 1; done
+          pytest -s tests/examples/${{ matrix.EXAMPLE }}
🧹 Nitpick comments (4)
.github/workflows/example_tests.yml (4)

63-63: actionlint “unknown runner label”: add linter config or include self‑hosted labels.

Either:

  • Add an actionlint config whitelisting the custom label; or
  • Use a label array including self-hosted + linux to be explicit.

Option A (add config file at .github/actionlint.yaml):

runnerLabels:
  - self-hosted
  - linux
  - linux-amd64-gpu-h100-latest-1

Option B (explicit labels):

-    runs-on: linux-amd64-gpu-h100-latest-1
+    runs-on: [self-hosted, linux, linux-amd64-gpu-h100-latest-1]

Run actionlint with the config to confirm it’s clean.

Also applies to: 85-85


69-74: LD_LIBRARY_PATH/PATH expansion in container.env won’t interpolate shell vars.

In the container env block, ${LD_LIBRARY_PATH} and ${PATH} are treated as literals, not expanded. Either drop the suffix or export/merge in a step.

Minimal safe fix: set in a step before installs so you preserve existing values:

     container: &example_container
       image: nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc2.post2
-      env:
-        LD_LIBRARY_PATH: "/usr/lib/x86_64-linux-gnu:/usr/local/tensorrt/targets/x86_64-linux-gnu/lib:${LD_LIBRARY_PATH}"
-        # PATH: "/usr/local/tensorrt/targets/x86_64-linux-gnu/bin:${PATH}"
-        PIP_CONSTRAINT: "" # Disable pip constraint for upgrading packages
+      env:
+        PIP_CONSTRAINT: "" # Disable pip constraint for upgrading packages
@@
     steps: &example_steps
       - uses: actions/checkout@v4
       - uses: nv-gha-runners/setup-proxy-cache@main
+      - name: Merge env paths
+        shell: bash
+        run: |
+          echo "LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/tensorrt/targets/x86_64-linux-gnu/lib:${LD_LIBRARY_PATH}" >> "$GITHUB_ENV"
+          echo "PATH=/usr/local/tensorrt/targets/x86_64-linux-gnu/bin:${PATH}" >> "$GITHUB_ENV"

Apply the same change in the non‑PR job if you inline steps there.


36-48: Broaden change detection to catch packaging/test dependency edits.

Include pyproject.toml and example requirements variants so PR gating doesn’t miss impactful changes.

           files: |
             .github/workflows/example_tests.yml
             examples/llm_ptq/**
             modelopt/torch/**
             tests/examples/llm_ptq/**
             setup.py
+            pyproject.toml
+            examples/**/requirements*.txt

75-81: Harden steps: pin external actions, upgrade pip toolchain, fail fast.

  • Pin nv-gha-runners action to a commit SHA (avoid @main drift).
  • Upgrade pip/setuptools/wheel before installing.
  • Prefer python -m pip and set -euo pipefail.
       - uses: actions/checkout@v4
-      - uses: nv-gha-runners/setup-proxy-cache@main
+      - uses: nv-gha-runners/setup-proxy-cache@<commit-sha>
       - name: Run example tests
-        run: |
-          pip install ".[all,dev-test]"
-          find examples/${{ matrix.EXAMPLE }} -name "requirements.txt" | while read req_file; do pip install -r "$req_file" || exit 1; done
-          pytest -s tests/examples/${{ matrix.EXAMPLE }}
+        shell: bash
+        run: |
+          set -euo pipefail
+          python -m pip install -U pip setuptools wheel
+          python -m pip install ".[all,dev-test]"
+          find examples/${{ matrix.EXAMPLE }} -name "requirements.txt" | while read req_file; do python -m pip install -r "$req_file" || exit 1; done
+          pytest -s tests/examples/${{ matrix.EXAMPLE }}

If you want, I can resolve the exact commit SHA and raise a follow-up PR to pin it.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c01a56f and 11835b1.

📒 Files selected for processing (8)
  • .github/workflows/code_quality.yml (1 hunks)
  • .github/workflows/example_tests.yml (1 hunks)
  • .github/workflows/gpu_tests.yml (0 hunks)
  • .github/workflows/pages.yml (1 hunks)
  • .github/workflows/unit_tests.yml (0 hunks)
  • .gitlab/tests.yml (1 hunks)
  • docker/Dockerfile (2 hunks)
  • tests/examples/speculative_decoding/test_medusa.py (0 hunks)
💤 Files with no reviewable changes (3)
  • .github/workflows/unit_tests.yml
  • .github/workflows/gpu_tests.yml
  • tests/examples/speculative_decoding/test_medusa.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • .github/workflows/pages.yml
  • .github/workflows/code_quality.yml
  • .gitlab/tests.yml
  • docker/Dockerfile
🧰 Additional context used
🪛 actionlint (1.7.7)
.github/workflows/example_tests.yml

63-63: label "linux-amd64-gpu-h100-latest-1" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


82-82: "steps" section is missing in job "example-tests-non-pr"

(syntax-check)


85-85: label "linux-amd64-gpu-h100-latest-1" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


90-90: "container" section is alias node but mapping node is expected

(syntax-check)


91-91: "steps" section must be sequence node but got alias node with "" tag

(syntax-check)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: linux
  • GitHub Check: wait-checks / wait
  • GitHub Check: wait-checks / wait
  • GitHub Check: build-docs
  • GitHub Check: code-quality
🔇 Additional comments (3)
.github/workflows/example_tests.yml (3)

12-16: Concurrency group LGTM.

The group key cleanly de-duplicates runs per PR branch vs. SHA; cancel-in-progress is appropriate.


49-59: Wait‑gate logic LGTM.

Gating on DCO and linux unit tests before GPU runs is sound; reusable workflow usage looks correct.


92-100: Required‑check guard LGTM with skip handling.

Condition correctly fails PR when relevant changes occur and GPU tests didn’t succeed; handles skipped paths.

Signed-off-by: Keval Morabia <[email protected]>
@kevalmorabia97 kevalmorabia97 force-pushed the kmorabia/dont-preinstall-docker-deps branch from b423dbb to 24db1b5 Compare September 18, 2025 21:04
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/pruning/cifar_resnet.ipynb (1)

339-343: Bug: misuse of load_state_dict arguments.

load_state_dict doesn’t take a device argument; this passes device as strict. Also misses map_location on torch.load.

Apply:

-    model.load_state_dict(torch.load(ckpt_path), device)
+    state = torch.load(ckpt_path, map_location=device)
+    model.load_state_dict(state)
♻️ Duplicate comments (3)
docker/Dockerfile (2)

3-6: Invalid/overbroad TORCH_CUDA_ARCH_LIST; drop 12.0 and only +PTX on highest SM.

“12.0+PTX” is not recognized by many current PyTorch/CUDA combos and bloats build. Keep supported SMs; add +PTX only to the highest.

Apply:

-    TORCH_CUDA_ARCH_LIST="8.0 8.6 8.7 8.9 9.0 10.0 12.0+PTX"
+    TORCH_CUDA_ARCH_LIST="8.0 8.6 8.7 8.9 9.0+PTX"
+# If your toolchain truly supports SM 10.0, use:
+# TORCH_CUDA_ARCH_LIST="8.0 8.6 8.7 8.9 9.0 10.0+PTX"

27-27: Avoid world-writable perms; use non-root user with ownership.

chmod -R 777 /workspace is a security smell and breaks shared volumes’ perms.

Apply:

-# Allow users to run without root
-RUN chmod -R 777 /workspace
+# Allow users to run without root
+RUN useradd -m -u 1000 appuser && \
+    chown -R appuser:appuser /workspace
+USER appuser
.github/workflows/example_tests.yml (1)

82-91: Fix YAML anchor usage for container/steps — actionlint errors; inline blocks.

Anchors here trigger “container alias mapping expected” and “steps must be sequence” in actionlint; inline the container and steps for the non-PR job.

Apply:

   example-tests-non-pr:
     if: ${{ !startsWith(github.ref, 'refs/heads/pull-request/') }}
     # Runner list at https://github.com/nv-gha-runners/enterprise-runner-configuration/blob/main/docs/runner-groups.md
     runs-on: linux-amd64-gpu-h100-latest-1
     timeout-minutes: 90
     strategy:
       matrix:
         EXAMPLE: [llm_ptq]
-    container: *example_container
-    steps: *example_steps
+    container:
+      image: nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc2.post2
+      env:
+        LD_LIBRARY_PATH: "/usr/lib/x86_64-linux-gnu:/usr/local/tensorrt/targets/x86_64-linux-gnu/lib"
+        PIP_CONSTRAINT: ""
+    steps:
+      - uses: actions/checkout@v4
+      - uses: nv-gha-runners/setup-proxy-cache@main
+      - name: Run example tests
+        run: |
+          pip install ".[all,dev-test]"
+          find examples/${{ matrix.EXAMPLE }} -name "requirements.txt" | while read req_file; do pip install -r "$req_file" || exit 1; done
+          pytest -s tests/examples/${{ matrix.EXAMPLE }}
🧹 Nitpick comments (5)
examples/pruning/cifar_resnet.ipynb (4)

25-33: Avoid unpinned, unconditional installs; gate with import checks and use %pip.

Unconditionally installing nvidia-modelopt can override the editable/source install in your Docker/CI and create version skew. Prefer conditional installs and Jupyter-native %pip; only install torchvision explicitly here.

Apply:

-! pip install nvidia-modelopt torchvision
+%pip install -q torchvision
+import importlib
+if importlib.util.find_spec("modelopt") is None:
+    %pip install -q nvidia-modelopt

60-61: Graceful CPU fallback for device selection.

torch.device("cuda") will hard-fail on CPU-only envs (e.g., CI preview, Colab T4 quota exhaustion).

Apply:

-device = torch.device("cuda")
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

287-288: Loss averaging mismatch.

epoch_loss sums per-batch mean losses; dividing by number of samples underestimates the value. Divide by number of batches or accumulate sum of per-sample losses.

Apply:

-    epoch_loss /= len(train_loader.dataset)
+    epoch_loss /= len(train_loader)

662-666: Typo in checkpoint filename (“seaarch”).

This will create inconsistent artifacts and breaks resume-by-name tooling.

Apply:

-        "checkpoint": "modelopt_seaarch_checkpoint_fastnas.pth",
+        "checkpoint": "modelopt_search_checkpoint_fastnas.pth",
.github/workflows/example_tests.yml (1)

71-74: Env var expansion in container.env won’t interpolate ${LD_LIBRARY_PATH}.

GitHub Actions doesn’t expand env references inside the container env block; the literal string is passed. Either set a fixed value or export in a run step.

Apply:

-      env:
-        LD_LIBRARY_PATH: "/usr/lib/x86_64-linux-gnu:/usr/local/tensorrt/targets/x86_64-linux-gnu/lib:${LD_LIBRARY_PATH}"
-        # PATH: "/usr/local/tensorrt/targets/x86_64-linux-gnu/bin:${PATH}"
-        PIP_CONSTRAINT: "" # Disable pip constraint for upgrading packages
+      env:
+        LD_LIBRARY_PATH: "/usr/lib/x86_64-linux-gnu:/usr/local/tensorrt/targets/x86_64-linux-gnu/lib"
+        PIP_CONSTRAINT: "" # Disable pip constraint for upgrading packages

Optionally append in the step:

- name: Append LD_LIBRARY_PATH
  run: echo "LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/tensorrt/targets/x86_64-linux-gnu/lib" >> $GITHUB_ENV
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 11835b1 and 24db1b5.

📒 Files selected for processing (16)
  • .github/workflows/code_quality.yml (1 hunks)
  • .github/workflows/example_tests.yml (1 hunks)
  • .github/workflows/gpu_tests.yml (0 hunks)
  • .github/workflows/pages.yml (1 hunks)
  • .github/workflows/unit_tests.yml (0 hunks)
  • .gitlab/tests.yml (1 hunks)
  • README.md (1 hunks)
  • docker/Dockerfile (2 hunks)
  • docs/source/getting_started/_installation_for_Linux.rst (2 hunks)
  • examples/cnn_qat/requirements.txt (1 hunks)
  • examples/llm_eval/requirements.txt (1 hunks)
  • examples/onnx_ptq/docker/Dockerfile (1 hunks)
  • examples/onnx_ptq/requirements.txt (1 hunks)
  • examples/pruning/cifar_resnet.ipynb (1 hunks)
  • setup.py (1 hunks)
  • tests/examples/speculative_decoding/test_medusa.py (0 hunks)
💤 Files with no reviewable changes (3)
  • .github/workflows/gpu_tests.yml
  • .github/workflows/unit_tests.yml
  • tests/examples/speculative_decoding/test_medusa.py
✅ Files skipped from review due to trivial changes (1)
  • examples/cnn_qat/requirements.txt
🚧 Files skipped from review as they are similar to previous changes (9)
  • examples/onnx_ptq/requirements.txt
  • examples/llm_eval/requirements.txt
  • setup.py
  • .github/workflows/pages.yml
  • README.md
  • docs/source/getting_started/_installation_for_Linux.rst
  • .github/workflows/code_quality.yml
  • examples/onnx_ptq/docker/Dockerfile
  • .gitlab/tests.yml
🧰 Additional context used
🪛 actionlint (1.7.7)
.github/workflows/example_tests.yml

63-63: label "linux-amd64-gpu-h100-latest-1" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


82-82: "steps" section is missing in job "example-tests-non-pr"

(syntax-check)


85-85: label "linux-amd64-gpu-h100-latest-1" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


90-90: "container" section is alias node but mapping node is expected

(syntax-check)


91-91: "steps" section must be sequence node but got alias node with "" tag

(syntax-check)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: linux
  • GitHub Check: wait-checks / wait
  • GitHub Check: wait-checks / wait
🔇 Additional comments (2)
docker/Dockerfile (1)

20-25: Nice: local editable install + precompile improves dev ergonomics.

Please confirm TensorRT-Model-Optimizer[all] excludes example/test-only deps (e.g., torchvision) per PR goal; CI workflow installs dev-test separately.

.github/workflows/example_tests.yml (1)

79-81: Install order looks good; keep example deps runtime-only.

Repo + dev-test first, then example-specific requirements, then pytest — matches PR goals.

@kevalmorabia97 kevalmorabia97 merged commit 4c36abe into main Sep 19, 2025
27 checks passed
@kevalmorabia97 kevalmorabia97 deleted the kmorabia/dont-preinstall-docker-deps branch September 19, 2025 05:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants